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Abstract 

Decision  theory  does  not  traditionally  include  uncer¬ 
tainty  over  utility  functions.  We  argue  that  the  a  per¬ 
son’s  utility  value  for  a  given  outcome  can  be  treated 
as  we  treat  other  domain  attributes:  as  a  random  vari¬ 
able  with  a  density  function  over  its  possible  values. 

We  show  that  we  can  apply  statistical  density  estima¬ 
tion  techniques  to  learn  such  a  density  function  from  a 
database  of  partially  elicited  utility  functions.  In  par¬ 
ticular,  we  define  a  Bayesian  learning  framework  for 
this  problem,  assuming  the  distribution  over  utilities 
is  a  mixture  of  Gaussians,  where  the  mixture  compo¬ 
nents  represent  statistically  coherent  subpopulations. 

We  can  also  extend  our  techniques  to  the  problem  of 
discovering  generalized  additivity  structure  in  the  util¬ 
ity  functions  in  the  population.  We  define  a  Bayesian 
model  selection  criterion  for  utility  function  structure 
and  a  search  procedure  over  structures.  The  factoriza¬ 
tion  of  the  utilities  in  the  learned  model,  and  the  gen¬ 
eralization  obtained  from  density  estimation,  allows  us 
to  provide  robust  estimates  of  utilities  using  a  signif¬ 
icantly  smaller  number  of  utility  elicitation  questions. 

We  experiment  with  our  technique  on  synthetic  utility 
data  and  on  a  real  database  of  utility  functions  in  the 
domain  of  prenatal  diagnosis. 

1  Introduction 

The  principle  of  maximizing  expected  utility  has  long  been 
established  as  the  guide  to  making  rational  decisions  [21]. 
It  rests  on  two  components:  probabilities  for  representing 
our  uncertainty  about  the  situation,  and  utilities  for  repre¬ 
senting  our  preferences. 

Traditional  decision  theory  ignores,  however,  any  uncer¬ 
tainty  we  may  have  about  the  utilities  of  a  given  user.  To 
apply  it,  we  need  to  acquire  the  entire  utility  function.  We 
cannot  use  any  prior  knowledge,  either  in  the  form  of  ex¬ 
perience  with  other  users  or  in  the  form  of  constraints.  By 
treating  utilities  as  random  variables,  we  can  utilize  tools 
that  have  been  used  with  great  success  when  reasoning 
about  events  in  decision  problems.  For  example,  we  can 
use  value  of  information  to  decide  whether  a  utility  elicita¬ 
tion  question  is  worth  asking  [4]. 

Before  we  can  apply  these  tools,  however,  we  need  to 
address  the  issue  of  acquiring  distributions  over  utilities. 


The  problem  of  model  acquisition  is  well-understood  in  the 
context  of  probabilistic  models,  with  a  significant  body  of 
work  both  on  eliciting  models  from  experts  and  on  learning 
from  sample  data.  By  contrast,  the  problem  of  acquiring 
utility  functions  is  not  understood  nearly  as  well.  In  some 
sense,  utility  elicitation  is  innately  harder.  There  are  no  ex¬ 
perts  to  ask  about  the  model;  every  person’s  utility  function 
may  be  different.  Thus,  in  the  traditional  approach,  each  in¬ 
dividual’s  utility  for  each  of  the  possible  outcomes  must  be 
elicited.  In  domains  involving  more  than  a  few  outcomes, 
this  elicitation  process  is  time  consuming  and  cognitively 
difficult.  It  is  also  noisy  and  prone  to  errors  [15]. 

The  use  of  structure  is  crucial  for  probabilities,  simpli¬ 
fying  both  the  model  and  the  associated  knowledge  acqui¬ 
sition  process.  Structure  also  exists  in  utilities.  Utility 
functions  can  often  be  decomposed  as  a  linear  combination 
of  subutility  functions ,  each  of  which  involves  only  a  few 
of  the  relevant  variables.  Decomposable  utility  functions 
can  be  used  to  support  more  efficient  inference  [14,  20]. 
In  principle,  as  they  require  fewer  parameters  to  be  speci¬ 
fied,  they  should  also  ease  the  knowledge  acquisition  pro¬ 
cess  [15]. 

In  practice,  however,  we  see  that  decomposable  util¬ 
ity  functions  are  rarely  used.  (Except  in  certain  settings 
where  everything  easily  reduces  to  a  common  basis,  such 
as  money.)  The  problem  is  that  the  structure  in  utility  func¬ 
tions  seems  elusive,  perhaps  because  there  is  little  method¬ 
ology  for  discovering  it.  Several  papers  [9,  17]  have  tried 
to  detect  simple  additive  decompositions  in  a  database  of 
elicited  utility  functions  using  linear  regression;  unfortu¬ 
nately,  additive  structure  rarely  seems  to  exist  in  these 
databases,  so  one  typically  resorts  back  to  explicit  utility 
elicitation  for  the  entire  outcome  space.  We  know  of  no 
attempts  to  learn  more  complex  utility  functions  from  data. 
Alternatively,  one  could  ask  specific  individuals  about  their 
decomposition.  However,  this  approach  is  difficult  to  im¬ 
plement.  Unlike  probabilities,  utilities  cannot  be  marginal¬ 
ized.  The  utility  of  a  specific  instantiation  of  one  state  at¬ 
tribute  does  not  have  any  intuitive  meaning  and  cannot  be 
assessed  without  making  some  assumptions  about  the  val¬ 
ues  of  other  attributes.  Thus,  the  decomposition  of  utility 
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functions  is  much  less  intuitive  for  people  to  understand 
than  the  decomposition  of  probability  functions. 

In  this  paper,  we  take  a  much  more  general  approach  to 
the  problem  of  discovering  the  structure  of  utility  functions. 
We  assume  that  we  have  access  to  a  database  of  (partially) 
elicited  utility  functions  for  some  set  of  individuals.  This 
assumption  is  not  unreasonable:  many  medical  informatics 
centers  collect  large  databases  of  utility  functions  for  var¬ 
ious  decision  problems  or  for  cost-benefit  analyses  of  new 
treatments  [10, 18].  Given  such  a  database,  we  apply  statis¬ 
tical  learning  techniques  to  discover  a  decomposition  that 
fits  the  data  well.  More  specifically,  we  postulate  a  model 
where  the  population  is  comprised  of  several  statistically 
coherent  subpopulations,  or  clusters.  The  utility  functions 
in  each  cluster  are  assumed  to  be  decomposed  in  some  way, 
and  the  parameters  of  the  subutilities  are  assumed  to  come 
from  a  Gaussian  distribution.  Note  that  we  do  not  assume 
that  any  of  these  model  parameters  are  known.  We  do  not 
know  which  person  belongs  to  which  cluster,  or  even  which 
decomposition  is  used  in  the  different  clusters.  Rather,  we 
are  given  only  a  standard  database  of  fully  explicit  utility 
functions  (where  some  of  the  values  may  be  missing). 

Our  approach  allows  us  to  learn  substantially  more  ex¬ 
pressive  models  than  the  naive  linear  regression  approach, 
and  thereby  discovers  structures  that  are  invisible  to  lin¬ 
ear  regression.  Furthermore,  the  model  produced  by  our 
learning  algorithm  can  be  used  to  make  the  utility  elicita¬ 
tion  process  more  robust  and  easier  for  the  user. 

2  Factored  Utility  Functions 

The  naive  representation  of  a  utility  function  is  a  vector  of 
real  numbers,  ascribing  a  utility  to  each  possible  outcome. 
This  representation  is  quite  reasonable  in  domains  involv¬ 
ing  a  small  number  of  distinct  outcomes.  Many  real-life 
domains,  however,  involve  fairly  complex  outcomes.  Con¬ 
sider,  for  example,  the  domain  of  prenatal  testing.  Prenatal 
testing  is  intended  to  diagnose  the  presence  of  a  chromo¬ 
somal  abnormality  such  as  Down’s  syndrome  in  the  early 
weeks  of  pregnancy,  an  event  whose  probability  increases 
with  maternal  age.  The  two  tests  currently  available  to 
diagnose  it,  chorionic  villus  sampling  (CVS)  and  amnio¬ 
centesis  (AMNIO),  carry  a  significant  risk  of  miscarriage 
above  the  baseline  rate.  The  risk  is  higher  for  CVS,  but  it 
is  more  accurate  and  can  be  performed  earlier  in  the  preg¬ 
nancy.  Both  miscarriage  and  elective  termination  of  the 
pregnancy  may  reduce  the  chances  of  future  pregnancy.  In 
this  domain,  a  typical  outcome  is  “healthy  fetus,  early  test 
(CVS),  accurate  test  result,  procedure-related  miscarriage, 
no  future  pregnancy”. 

In  such  cases,  it  is  convenient  to  describe  the  space  of 
outcomes  as  the  set  of  possible  assignments  of  values  to 
a  set  of  relevant  variables.  Here,  we  have  five  utility  at¬ 
tributes:  testing  T  (none,  CVS  or  amniocentesis),  fetus’s 
status  D  (normal,  affected  by  Down’s  syndrome),  possible 
loss  of  pregnancy  L  (no  loss,  miscarriage,  elective  termi¬ 


nation),  knowledge  of  the  fetus’s  status  K  (none,  accurate, 
inaccurate),  and  future  successful  pregnancy  F  (true,  false). 
The  utility  is  a  function  of  all  of  these  values.  In  general, 
we  define  each  outcome  as  an  assignment  to  a  set  of  at¬ 
tribute  variables  X  =  {Xi , . . .  ,Xn}.  Each  variable  X/  has  a 
domain  Dom(Xj)  of  two  or  more  elements. 

Clearly,  the  number  of  outcomes  is  exponential  in  the 
number  of  attributes.  Thus,  the  specification  of  the  util¬ 
ity  function  in  full  can  become  expensive.  In  many  med¬ 
ical  domains,  there  are  tens  of  outcomes.  In  our  domain, 
there  are  108  distinct  outcomes;  even  after  simplification 
and  elimination  of  very  unlikely  outcomes,  66  outcomes 
remain.  Utility  elicitation,  which  in  the  best  of  cases  is  a 
long  and  tiring  process,  is  extremely  difficult  for  outcome 
spaces  of  this  size.1 

In  many  cases,  however,  the  utility  function  is  not  a 
single  amorphous  function  over  the  space  of  outcomes. 
Rather,  it  exhibits  some  structure.  One  particularly  impor¬ 
tant  subclass  of  utility  functions  are  those  that  decompose 
into  components  associated  with  smaller  sets  of  attributes. 
For  example,  in  a  vacation  planning  domain,  we  might  be 
able  to  construct  our  overall  utility  as  a  sum  of  functions 
associated  with  the  cost  of  the  vacation,  with  the  weather 
in  our  destination,  with  the  quality  of  the  accommodations, 
etc.  This  type  of  decomposition  lies  at  the  heart  of  multi- 
attribute  utility  theory  [15]. 

Definition  2,1 :  Let  C  be  a  set  of  clusters  of  variables 
Ci,...,Cr.  We  say  that  a  utility  function  is  factored  ac¬ 
cording  to  C  if  there  exist  functions  w;  :  Dom(C,*)  R 
(i  =  1, - • . ,r)  such  that  «(x)  =  Xl-i*,-(c/)  where  c /  is  the  as¬ 
signment  to  the  variables  in  C /  in  x.  We  call  the  functions 
Ui  subutility  functions .  | 

The  factorization  of  the  utility  function  induces  observ¬ 
able  patterns  for  the  utilities  of  related  outcomes.  Some  of 
these  cases  have  received  a  lot  of  attention  in  the  literature. 
For  example,  if  the  clusters  are  disjoint,  then  the  change 
in  the  utility  resulting  from  changing  the  assignment  to  the 
variables  in  one  cluster  does  not  depend  on  the  assignments 
to  the  variables  in  the  other  clusters.  In  this  case,  the  util¬ 
ity  function  is  said  to  be  additive  over  C.  The  intuitive 
behavior  induced  by  additive  utility  functions  makes  them 
relatively  easy  to  describe  to  a  user  and  to  test  for  during 
the  process  of  utility  elicitation. 

A  related  concept  is  that  of  conditionally  additive  utility 
functions.  Let  Y,Z,V  be  a  disjoint  partition  of  X.  We  say 
that  Y  and  V  are  conditionally  additively  independent  given 
Z  if,  for  any  fixed  value  z  of  Z,  we  have  that  Y  and  V 
are  additively  independent  in  the  utility  function  m(Y,  V,z). 
This  type  of  decomposition  is  also  relatively  easy  to  test  for, 
and  hence  is  usable. 

]In  this  prenatal  testing  domain,  the  speed  of  utility  elicita¬ 
tion  was  around  10  outcomes  per  hour  [16].  We  were  also  told 
by  several  utility  elicitation  practitioners  that  the  probability  of 
inconsistent  answers  rises  sharply  after  the  first  few  questions  as 
the  fatigue  grows. 


However,  the  definition  of  factored  utility  functions  cov¬ 
ers  many  more  cases  than  these  special  cases.  Consider, 
for  example,  a  set  of  clusters  C  consisting  of  the  three  clus¬ 
ters  {A,#},  {£,C},  {C,A}.  As  pointed  out  by  Bacchus  and 
Grove  [1],  a  utility  function  that  factorizes  in  this  way  does 
not  have  any  of  the  commonly  defined  additive  indepen¬ 
dence  properties.  They  call  such  models  generalized  addi - 
tively  independent.  They  continue  to  say  that,  while  utility 
functions  that  factorize  in  this  way  may  well  be  useful  in 
practice,  their  lack  of  intuitive  semantics  makes  them  hard 
to  incorporate  into  a  utility  elicitation  process. 

Factored  utility  functions  can  be  incorporated  very  nat¬ 
urally  into  influence  diagrams  [13].  Moreover,  a  factored 
utility  function  can  be  exploited  by  standard  clique  tree  in¬ 
ference  algorithms  to  make  decision  making  more  efficient, 
in  much  the  same  way  as  factored  probability  distributions 
are  exploited  in  Bayesian  network  inference  [14,  20]. 

Factored  utilities  admit  a  representation  in  terms  of 
subutility  functions  over  a  much  smaller  domain.  They  can 
therefore  be  specified  using  a  much  smaller  set  of  param¬ 
eters.  However,  there  are  many  slightly  different  ways  to 
parameterize  a  factored  utility  function  over  C.  We  choose 
one  that  will  allow  us  to  make  our  learning  algorithm  more 
efficient. 


u  :  X  h+  R,  there  exist  coefficients  w\ , . . . ,  wk  such  that 

We  now  use  these  basic  building  blocks  to  construct  an 
orthogonal  basis  for  functions  over  the  entire  set  of  out¬ 
comes.  With  a  slight  abuse  of  notation,  we  will  view  a 
function  hf  as  a  function  over  Dom(V).  Let  o  be  an  out¬ 
come;  recall  that  o  defines  a  value  X[o]  for  each  variable 
X  £  V.  We  simply  define  hf  (o)  =  /if  (X[tf]). 

We  can  now  define  a  basis  for  a  cluster  of  variables  C 
as  the  set  of  all  functions  that  are  composed  as  products  of 
basis  functions  for  the  individual  variables  in  c: 

X[C)  =  {UhX  :  hx  <=M[X]}. 
xec 

Proposition  2.3:  The  functions  in  T([C\  are  pairwise  or¬ 
thogonal  and  the  set  0f[C\  exactly  spans  the  set  of  all  pos¬ 
sible  functions  over  C. 

By  taking  the  union  of  the  bases  for  the  appropriate  clus¬ 
ters,  we  can  span  any  set  of  factored  utility  functions. 

Corollary  2.4:  Let  C  be  a  set  of  clusters.  The  set  of  func¬ 
tions  T[[C]  =  Ucec^lC]  spans  the  set  of  all  factored  utility 
functions  over  C. 


Definition  2.2:  We  say  that  two  functions  h,h'  over  some 
domain  £1  are  orthogonal  if  *  hf( co)  =  0. 1 

Our  goal  will  be  to  construct  a  fixed  basis  he  of  orthog¬ 
onal  functions,  and  represent  a  factored  utility  function  u 
over  C  as  a  linear  combination  of  the  functions  in  this  basis. 
The  coefficients  w  of  the  different  basis  functions  would  be 
the  parameters  specifying  u.  The  orthogonality  property 
will  allow  us  to  perform  the  computation  described  in  the 
subsequent  sections  more  efficiently. 

The  atomic  units  in  the  construction  of  our  basis  are  the 
basis  functions  that  depend  only  on  a  single  variable.  For 
each  variable  X  with  values  x\ , . . .  ,jc*,  we  define  a  set  of  k 
basis  functions  /if , . . . ,  hf  :  Dom(X)  R.  Our  construc¬ 
tion  is  such  that: 

•  /if  =  1,  i.e.,  /if  (xi)  =  1  for  all  i; 

•  the  hf  functions  are  pairwise  orthogonal. 

For  a  binary- valued  attribute  £,  we  simply  define: 

=  i 

*2  (*2)  =  -1 

For  a  three- valued  attribute  C,  we  define: 


We  can  therefore  parameterize  any  factored  utility  func¬ 
tion  over  C  using  a  set  of  coefficients  wt-,  one  for  every 
function  in  fH\C\.  How  many  parameters  are  required? 
For  each  cluster  C,  we  have  |Dom(C)|  functions  in  0f[Q]. 
However,  the  bases  for  the  different  clusters  are  not  dis¬ 
joint. 

Example 2.5:  Assume  that  our  clusters  are  {A},{£,C} 
and  {C,D},  and  that  all  of  our  variables  are  ternary.  We 
have  3  functions  in  9([A]9  and  9  in  each  of  ?{[{B,C}]  and 
9{[{C,D}}.  However,  the  h\  (all  1)  function  is  common  to 
all  clusters,  and  the  three  hc  functions  are  common  to  the 
two  clusters  that  contain  C.  Of  course,  we  must  be  careful 
not  to  undercount  by  doublecounting  the  overlap:  h\  is  also 
among  the  three  functions  in  !H[C\.  A  careful  count  reveals 
that  the  total  number  of  distinct  functions  in  our  basis  is 
3  +  9  +  9-3-1-14-1=  17.1 

In  general,  we  can  compute  the  total  number  of  distinct 
functions  in  our  basis  by  a  simple  inclusion-exclusion  for¬ 
mula,  keeping  in  mind  that  the  overlap  between  the  bases 
for  two  clusters  C  and  C'  is  precisely  the  basis  for  C  D  C' 
(taking  9{[@\  to  be  the  single  vector  h\): 


*2(^0  =  1 

h%{x\)  =  0 

h%(xi)  =  -1 


h^(x  1)  =  1 

hf(x\)  =  “2 

h%(xi)  =  1 


In  general,  we  can  define  a  set  9i[X]  of  orthogonal  ba¬ 
sis  functions  for  any  k- ary  variable  X.  Note  that,  as  the 
functions  are  orthogonal,  they  span  the  space  of  all  possi¬ 
ble  functions  over  X.  In  other  words,  for  every  function 


mc}\  =  £|^[C,]|-  X  l^[C,1UC,2]| 

* 

=  +  £  |#[C,-,  U  C/2  U  C/3]| - 

M  7^2+3 

Thus,  the  total  number  of  basis  functions,  and  thereby  of 
parameters  required,  grows  (at  most)  linearly  with  the  num¬ 
ber  of  clusters  and  exponentially  with  the  size  of  each  one. 


3  The  Basic  Framework 

Our  approach  relies  on  a  few  basic  assumptions  about  the 
population  of  users  whose  utility  we  are  trying  to  model. 
The  first  assumption  is  that  the  population  is  composed  of 
several  disjoint  subpopulations,  or  types  (which  we  model 
using  a  random  variable  T),  where  the  utility  functions  of 
the  individuals  of  each  type  are  statistically  similar.  Each 
subpopulation  may  utilize  a  different  factorization  Q  of  the 
utility  function.  Thus,  every  individual  is  associated  with  a 
vector  w,  of  dimension  m,  =  |#[Q]|,  where  each  wj  is  the 
coefficient  of  the  jth  basis  function  hj  €  M[G\.  The  vector 
w,  [y]  represents  the  user’s  subutility  functions. 

We  represent  a  probabilistic  model  over  utilities  by 
defining  a  vector  random  variable  W,.  For  each  value  t 
of  T ,  P(W/ 1 1)  is  a  multivariate  Gaussian  with  mean  vector 
ft  and  covariance  matrix  We  assume  that  individuals 
in  the  population  are  IID  samples  from  the  P({ W,},  |  T) 
distribution. 

An  individual’s  subutility  vector  w,  defines  a  complete 
utility  function,  which  specifies  a  utility  for  each  of  the 
n  =  |Dom(X)|  outcomes  o.  We  can  define  this  implicit 
utility  function  using  a  simple  matrix  operation.  Let  A,  be 
the  n  x  m,  matrix  (oL)  where  a\s  =  hj(o,)  for  o,  the  ith 
possible  outcome.  Then,  the  user’s  utility  function  ought 
to  be  u*  =  A, wf.  However,  the  utility  elicitation  process 
can  be  quite  noisy.  We  accommodate  for  that  by  assuming 
that  the  user’s  actual  utility  vector  u  is  modified  by  some 
white  noise,  i.e.,  for  each  o,  we  have  that  u0  is  m*  plus  some 
random  white  noise  t,  sampled  from  a  zero-mean  Gaus¬ 
sian  distribution  with  some  variance  of.  More  formally,  we 
have  a  vector  random  variable  U  of  dimension  n,  which  is 
a  linear  Gaussian  whose  mean  is  A,  W,  and  whose  variance 
is  of  l  where  1  is  the  unit  matrix. 

Note  that,  for  each  type  t,  the  distribution  over  W,,U  is 
a  simple  multivariate  Gaussian,  defined  using  a  Gaussian 
distribution  over  W,  and  a  conditional  linear  Gaussian  for 
U  given  Wr.  However,  the  distribution  as  a  whole  is  not 
exactly  a  mixture  of  linear  Gaussians,  as  the  dimension  of 
the  vector  w,  varies  for  the  different  types. 

A  model  such  as  this  can  be  used  for  several  purposes. 
The  most  basic  use  is  to  compute  the  most  probable  fac¬ 
tored  utility  function  for  a  given  user.  More  precisely,  as¬ 
sume  we  are  given  a  vector  u  representing  the  full  utility 
function  elicited  from  a  certain  user.  Our  goal  is  to  compute 
the  type  t  and  vector  w,  such  that  the  probability  P(w,  |  u,r) 
is  maximized.  We  perform  a  separate  computation  for  each 
t. 

From  the  definition  of  our  generative  model,  we  have 
that:  F(w,  |  u,r)  =  .  The  denominator  is  a  con¬ 

stant,  so  it  does  not  affect  the  choice  of  maximum.  Fur¬ 
thermore,  the  individual  components  U0  of  the  vector  vari¬ 
able  U  are  conditionally  independent  given  W,,  so  that  our 
goal  is  to  maximize  (I \P{u0  |  W,))  -P(W,  |  t).  Max¬ 
imizing  this  function  is  equivalent  to  minimizing  an  er¬ 
ror  function  corresponding  to  its  negative  logarithm  [2]: 


—  I  wf)  —  lnP(w,  |  r).  The  first  term  in  our  er¬ 

ror  function  (for  the  given  vector  u)  can  be  simplified  to 

-  u0)2 -Miner,  +  ^  ln(27t)  (1) 

zaf  o  2 

where  (A,)0  is  the  row  of  the  matrix  A,  that  corresponds  to 
the  outcome  o.  Simplifying  -  in P(w,  |  r),  we  get: 

y  ln(2n)  +  \ ln  I2' I +  \ (W« "  ft)7*.-1  (w<  -  ft)-  (2) 

If  we  put  together  (1)  and  (2),  and  eliminate  terms  that  do 
not  depend  on  w,,  we  get  as  our  final  error  function: 

=  AllAw,-u||2+i||B,w,-B(pJ||2 

where  Bf  Bt  =  Z,-1 .  (We  are  guaranteed  that  such  a  decom¬ 
position  exists  because  the  covariance  matrix  of  a  Gaussian 
is  guaranteed  to  be  positive  definite.) 

Thus,  maximizing  the  posterior  probability  of  the  vec¬ 
tor  wr  is  equivalent  to  minimizing  a  squared-error  function. 
Let  Dt  be  the  (n  +  mt)  x  mt  matrix  obtained  by  concatenat¬ 
ing  the  matrices  ±At  and  Bt .  We  also  define  a  vector  u'  of 
length  n  +  mt  defined  by  concatenating  and 

Note  that  we  designed  the  matrix  At  to  guarantee  that 
the  columns  of  Dt  are  linearly  independent.  Thus,  we  can 
compute  the  optimal  solution  to  the  least-squares  problem 
by  projection  [19]: 

W,  =  (dJd,)-'dJu' 

=  (^AjA.+i.r'r'Dju' 

The  matrix  (-^AjAt  +  does  not  depend  on  u, 

and  can  therefore  be  computed  once  and  reused  for  every 
individual  for  whom  we  want  to  estimate  w,. 

This  computation  gives  us,  for  each  type  r,  the  most 
likely  vector  w,  for  the  user  given  that  he  is  in  class  t .  We 
can  now  easily  compute  the  most  likely  pair  (t,  w,)  for  this 
user. 

This  model  can  also  be  used  to  give  us  more  informa¬ 
tion.  Recall  that  the  conditional  distribution  on  W„u  is 
a  multivariate  Gaussian  distribution.  At  the  cost  of  a  lit¬ 
tle  more  work,  we  can  compute  the  entire  posterior  distri¬ 
bution  ^(W,  |  u,/).  The  result  would  also  be  a  Gaussian 
distribution,  over  the  variables  W,.  The  mean  of  this  distri¬ 
bution  would  be  precisely  the  vector  w,  computed  above. 
The  covariance  matrix  of  the  distribution  could  be  used  as 
an  indicator  for  how  confident  we  are  in  our  estimate  w,. 
Clearly,  there  are  situations  where  this  information  can  be 
quite  important,  but  it  is  not  clear  that  it  is  always  worth  the 
computational  overhead.  On  the  other  hand,  unlike  projec¬ 
tion,  this  technique  can  be  used  even  if  some  of  the  values 
in  the  original  utility  vector  are  missing. 
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4  Model  Learning 

In  the  previous  section,  we  defined  a  statistical  model  of 
utilities  in  a  population  of  users,  and  showed  how  it  can  be 
used  to  compute  a  factorization  of  an  elicited  utility  func¬ 
tion.  We  now  move  to  tackling  the  problem  of  acquiring 
such  a  statistical  model. 

Our  goal  is  to  acquire  this  model  from  a  database  of  util¬ 
ity  functions  elicited  from  a  random  population  of  users. 
Even  if  the  utility  function  is  factored,  the  utility  elicitation 
process  is  typically  done  in  terms  of  utilities  of  full  out¬ 
comes.  This  is  certainly  the  case  if,  as  we  assumed,  the 
factorization  of  the  utility  function  is  unknown  in  advance. 
Thus,  we  assume  that  the  training  data  we  are  given  is  a  set 
of  utility  vectors  u [j],  one  for  each  individual.  We  assume 
that  some  of  the  components  of  the  utility  vectors  may  be 
missing.  The  type  variable  T  and  the  corresponding  de¬ 
composed  utility  vector  W,  are  unobserved  in  the  training 
data. 

Our  key  subroutine  is  the  parameter  estimation  task  for 
a  given  model.  While  we  cannot  use  full  Bayesian  esti¬ 
mation  in  the  presence  of  partially  observable  data,  it  will 
nevertheless  be  useful  to  view  the  model  parameters  as  hav¬ 
ing  a  prior  and  a  posterior.  This  perspective  will  be  useful 
both  for  smoothing  our  numerical  estimates  and  to  provide 
the  appropriate  bias  for  the  structure  selection  task. 

Suppose  that,  for  every  value  t  of  the  variable  7,  we  have 
an  mt  dimensional  multivariate  Gaussian  with  an  unknown 
mean  vector  /iy  and  an  unknown  covariance  matrix  .  An 
appropriate  conjugate  prior  over  /z,  and  is  the  Normal- 
Wishart  [7].  We  use  a  Normal- Wishart  prior  for  the  pa¬ 
rameters  of  each  of  the  type-specific  Gaussian  distributions 
over  W,  (one  for  each  type  t)  and  for  the  parameters  of  the 
conditional  Gaussian  over  the  U0  given  U*(o)  =  Af(<?)W,. 
We  assume  that  the  parameters  0,  representing  the  prior 
probability  P(T  =  t)  are  distributed  with  a  Dirichlet  dis¬ 
tribution. 

The  main  problem  is  that  our  data  is  only  partially  ob¬ 
servable,  rendering  full  Bayesian  estimation  infeasible.  We 
therefore  resort  to  finding  the  MAP  parameter  estimate 
using  the  expectation-maximization  (EM)  algorithm  [8]. 
More  precisely,  we  use  our  parameter  prior  to  define  a 
Gaussian  prior  distribution  over  W, ,  U.  For  each  instance 
j  and  each  type  f,  we  condition  this  distribution  on  u [/], 
and  obtain  a  Gaussian  posterior  |  f,u[/]).  We  use 

these  Gaussian  distributions  to  compute  expected  sufficient 
statistics:  the  expected  empirical  means  and  expected  em¬ 
pirical  covariances.  These  are  used  to  update  the  Wishart 
priors,  which  then  generate  a  new  Gaussian  prior  distribu¬ 
tion  over  Wf,U.  A  similar  update  is  done  to  the  Dirichlet 
distribution  over  the  types.  The  process  iterates  until  con¬ 
vergence.  We  describe  this  process  in  detail  in  Appendix  A. 

Now,  we  consider  the  problem  of  finding  a  good  struc¬ 
ture.  We  focus  on  the  problem  of  discovering  the  struc¬ 
ture  of  the  subutility  functions  within  the  clusters,  and  as¬ 
sume  the  number  of  clusters  is  given.  (Our  techniques  eas¬ 


ily  extend  to  the  more  standard  problem  of  discovering  the 
number  of  clusters.)  We  apply  Bayesian  model  selection  to 
this  task.  More  precisely,  we  define  a  discrete  variable  S 
whose  states  s  correspond  to  possible  models,  i.e.,  possible 
decompositions  of  the  subutilities  in  the  different  clusters; 
we  encode  our  uncertainty  about  S  with  the  probability  dis¬ 
tribution  P(s).  For  each  model  s,  we  define  a  continuous 
vector-valued  variable  whose  instantiations  \jrs  corre¬ 
spond  to  the  possible  parameters  of  the  model.  We  encode 
our  uncertainty  about  with  a  probability  density  func¬ 
tion  P(\\rs  |  s),  as  described  above. 

We  score  the  candidate  models  by  evaluating  the 
marginal  likelihood  of  the  data  set  D  given  the  model 
s  [12].  That  is,  we  want  to  compute 

P(D  \s)  =  Jp(D\  ys,s)P(Vs  |  s)P(s)dys. 

The  exact  computation  of  the  marginal  likelihood  is  in¬ 
tractable  for  models  with  hidden  variables.  We  approx¬ 
imate  it  using  a  scheme  introduced  by  Cheeseman  and 
Stutz  [5].  This  approximation  is  based  on  the  fact  that 
P(D  |  j)  can  be  computed  efficiently  for  complete  data.  If 
Dc  is  any  completion  of  the  data  set  D,  we  have 


P(D  |  s)  =  P(DC  |  s) 


//>(£>,  y,  |  s)dys 
fP(Dc,Vs  |  s)dys' 


Letting  \\fs  be  either  an  MAP  or  an  ML  estimate  for  \| /*,  we 
can  apply  the  BIC/MDL  approximation  to  the  numerator 
and  denominator,  and  get; 


logP(D  |  s) « logP(Df  |  i)+log/>(£»  |  \ys,s)  —  logP(Dc  |  Vs,*)- 


(In  our  case,  the  dimension  of  the  complete  data  is  the  same 
as  the  dimension  of  the  actual  data,  so  the  model  complex¬ 
ity  term  cancels  out.)  We  can  compute  the  last  two  terms 
in  this  estimate  fairly  efficiently  by  running  our  EM  algo¬ 
rithm  from  the  previous  section.  Chickering  and  Hecker- 
man  [6]  showed  that  this  approximation  is  surprisingly  ac¬ 
curate,  much  more  so  than  a  direct  use  of  BIC/MDL  [6]. 

The  first  term,  P(DC  |  j),  is  the  probability  of  a  complete 
data  set,  where  the  distribution  of  the  continuous  variables 
in  the  network,  conditioned  on  each  instantiation  of  the  dis¬ 
crete  variable  Type ,  is  a  multivariate  normal  distribution. 
Geiger  and  Heckerman  [1 1]  show  that,  in  the  case  of  com¬ 
plete  data,  the  marginal  likelihood  has  a  closed  form  that 
decomposes  (as  usual)  as  a  product  over  separate  famillies 
in  the  model.  We  omit  the  (straightforward)  details. 

Given  a  scoring  function,  we  can  apply  standard  tech¬ 
niques  for  finding  a  high-scoring  structure.  We  use  a  greedy 
hill-climbing  search  with  random  restarts.  Our  search 
space  operators  modify  the  subutility  structure  of  each  type 
separately.  An  operator  can  add  a  variable  to  an  existing 
subutility  function,  delete  a  variable  from  a  function,  or  in¬ 
troduce  a  new  subutility  function  with  a  single  variable.  We 
evaluate  each  candidate  successor  structure  by  running  EM 
on  it,  and  then  scoring  it  using  the  Cheeseman-Stutz  ap¬ 
proximation  to  the  Bayesian  score. 


5  Using  the  Model  for  Utility  Elicitation 

There  are  many  ways  to  use  the  model  we  learn  to  facilitate 
utility  elicitation  and  improve  the  quality  of  the  results. 

The  most  obvious  use  is  simply  to  use  the  model  as  a 
guide  to  the  range  of  utility  functions  within  the  population. 
In  particular,  our  model  incorporates  a  built-in  measure  of 
confidence.  When  we  assess  a  new  user’s  utility  function, 
we  can  immediately  discover  if  he  or  she  is  an  “outlier”  — 
a  person  with  an  atypical  utility  function.  We  can  ask  such 
a  person  additional  questions  to  make  sure  that  there  was 
no  error  in  the  process. 

A  somewhat  deeper  use  of  the  model,  along  the  same 
lines,  is  for  smoothing  the  results  of  the  utility  elicitation 
process  for  a  particular  individual  based  on  trends  in  the 
population  as  a  whole.  Given  the  amount  of  noise  in  the 
utility  elicitation  process,  smoothing  of  this  type  is  likely 
to  be  very  useful  in  getting  robust  utility  estimates. 

We  can  also  use  the  model  in  a  much  more  funda¬ 
mental  way  to  change  the  entire  utility  elicitation  process. 
For  (conditionally)  additive  decompositions,  Keeney  and 
Raiffa  [15]  describe  a  utility  elicitation  procedure  which 
exploits  the  structure  to  reduce  the  number  of  questions 
asked.  A  separate  scale  is  established  for  every  utility  func¬ 
tion  component  and  the  user  is  asked  a  series  of  questions 
about  its  parameters.  At  the  end,  a  new  set  of  assessments 
must  be  made  to  discover  the  scaling  constants.  This  pro¬ 
cedure  has  become  a  gold  standard  in  many  applications. 

This  method  cannot  take  advantage  of  the  more  general¬ 
ized  factorizations  allowed  by  our  algorithm.  We  propose 
an  alternative  procedure  which  is  general  enough  to  han¬ 
dle  all  factorizations.  When  we  assess  the  utility  function 
of  a  new  user,  we  only  need  to  ask  as  many  questions  as 
the  number  of  parameters  in  our  model.  The  simplest  way 
to  choose  the  outcomes  to  assess  is  to  convert  the  projec¬ 
tion  matrix  to  the  reduced  row  echelon  form  and  discard 
the  outcomes  corresponding  to  the  rows  consisting  entirely 
of  zeros.  Once  the  values  of  all  the  subutility  functions  are 
known,  we  can  compute  the  utility  values  for  the  remaining 
outcomes.  It  would  be  good  practice  to  double  check  that 
the  chosen  decomposition  really  matches  the  new  user’s 
utility  function  structure  by  asking  a  few  more  “redundant” 
questions  and  comparing  the  answers  with  those  predicted 
by  the  function  we  had  computed. 

This  procedure  can  also  be  modified  to  utilize  the  model 
in  a  more  principled  way.  We  can  view  the  utilities  elicited 
for  different  outcomes  as  evidence  in  the  distribution  de¬ 
fined  by  the  model.  We  can  then  use  standard  probabilistic 
inference  to  compute  the  distribution  over  the  user’s  subu¬ 
tility  functions.  The  more  utilities  we  elicit,  the  more  ev¬ 
idence  we  have,  the  more  certain  we  are  about  the  actual 
value  of  the  user’s  subutility  functions.  We  can  apply  tech¬ 
niques  such  as  conditional  mutual  information  or  variance 
reduction  to  decide,  at  each  point  in  time,  which  utility  elic¬ 
itation  question  is  likely  to  be  the  most  informative  about 
the  subutility  variables.  We  can  also  make  principled  deci- 
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Figure  1:  Best  decomposition  for  Visual  Analog  Scale  (a) 
and  Standard  Gamble  (b). 


sions  on  when  to  stop  the  elicitation  process  by  considering 
our  uncertainty  about  these  variables. 

Finally,  we  can  use  probabilistic  models  of  the  utility 
function  as  the  basis  for  a  more  targeted  process  of  utility 
elicitation.  In  a  given  decision  making  task,  the  utilities 
of  different  outcomes  typically  influence  the  decision,  and 
the  resulting  expected  utility,  to  radically  different  extents. 
Most  simply,  some  outcomes  may  have  very  low  probabil¬ 
ity  in  the  current  setting,  so  their  utility  is  largely  irrelevant. 
Having  a  distribution  over  the  utility  functions  in  the  pop¬ 
ulation,  we  can  compute  the  value  of  information  of  every 
elicitation  question;  we  can  then  focus  our  efforts  on  those 
that  have  the  highest  impact  on  our  actual  decision  [4], 

6  Experimental  Results 

We  tested  our  approach  on  both  real  and  synthetically  gen¬ 
erated  data. 

Our  primary  dataset  consists  of  utility  functions  elicited 
in  a  prenatal  diagnosis  study  performed  by  [17].  All  study 
subjects  were  recruited  from  the  University  of  Califor¬ 
nia  at  San  Francisco  (UCSF)  Prenatal  Diagnosis  Center. 
Study  subjects  were  recruited  from  a  counseling  session  for 
women  who  have  not  yet  decided  which  prenatal  diagnos¬ 
tic  test  to  undergo,  or,  in  some  cases,  whether  to  undergo 
prenatal  diagnosis  at  all. 

Out  of  70  subjects  we  selected  51  who  completed  the 
entire  interview,  which  involved  assessing  utilities  for  22 
outcomes  using  two  elicitation  methods:  standard  gamble 
(SG)  and  visual  analog  scale  (VAS).  These  two  methods 
are  known  to  produce  very  different  utility  values,  thus  we 
treated  the  two  sets  of  utilities  as  two  distinct  databases. 
We  treated  the  values  of  all  the  outcomes  the  women  were 
not  asked  about  as  missing. 

We  searched  the  space  of  1-,  2-  and  3-cluster  models. 
The  best  models  we  learned  for  our  two  databases  were  in 
both  cases  3-cluster  models.  They  are  presented  in  Fig¬ 
ure  1.  The  nodes  correspond  to  utility  attributes  in  our 
domain:  testing  ( T ),  Down’s  status  (D),  pregnancy  loss 
(L),  knowledge  ( K)  and  future  pregnancy  (F).  Additive 
and  conditional  additive  independence  corresponds  to  ver¬ 
tex  separation.  While  the  size  of  the  database  does  not  al¬ 
low  us  to  treat  our  models  as  representing  the  true  structure 
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Figure  2:  Learning  curves  for  several  models. 


of  the  utility  functions  in  the  population,  some  of  the  corre¬ 
lations  found  are  very  interesting.  For  example,  the  corre¬ 
lation  between  the  utilities  for  pregnancy  loss  and  utilities 
for  Down’s  status  and  future  pregnancy  are  highly  intuitive. 

We  note  that,  in  both  cases,  structures  having  multiple 
clusters  received  substantially  higher  scores  than  structures 
having  a  single  cluster.  Furthermore,  structures  where  the 
different  clusters  had  different  decompositions  scored  more 
highly  than  structures  where  all  clusters  used  the  same  de¬ 
composition.  This  supports  our  hypothesis  that  different 
subpopulations  exist,  and  have  different  decompositions. 

We  also  tested  our  algorithm  on  synthetic  data.  In  our 
artificial  domain,  we  had  3  utility  attributes,  one  ternary 
and  two  binary,  and  12  outcomes.  We  had  three  ba¬ 
sic  structures:  fully  additive;  structured,  in  which  u(o)  = 
ui(XuX2)  +  u2(X2,X3);  and  fully  connected  (no  indepen¬ 
dencies).  We  generated  10-20  distributions  for  each  struc¬ 
ture,  using  different  parameters. 

In  one  cluster  tests,  we  were  always  able  to  recover 
the  structure  of  the  original  distribution.  For  the  addi¬ 
tive  model,  the  correct  structure  was  chosen  after  seeing 
at  most  2  data  points.  (This  result  was  to  be  expected  given 
the  well-known  bias  towards  simpler  structures  in  Bayesian 
learning.)  For  the  structured  model,  the  number  of  samples 
needed  ranged  from  100  to  750.  For  the  fully  connected 
model,  we  needed  200-500  samples. 

In  two-cluster  tests,  small  amounts  of  data  (10-100  sam¬ 
ples)  always  resulted  in  a  model  with  one  fully  connected 
and  one  fully  additive  structure,  regardless  of  the  underly¬ 
ing  distribution.  Given  more  data  (1000-5000),  we  were 
able  to  learn  either  the  correct  structure  or  one  differing  by 
only  one  variable’s  presence  or  absence  in  a  subutility  func¬ 
tion.  We  obtained  these  results  for  models  with  the  same  as 
well  as  with  differing  decompositions  in  the  different  clus¬ 
ters. 

We  also  tested  our  algorithm  as  a  density  estimator.  For 
these  tests,  we  used  a  domain  with  4  attributes,  one  ternary 
and  three  binary.  We  had  two  structures:  one  fully  ad¬ 
ditive  and  one  structured  in  which  u(o)  =  u\(X\1X2)  + 
u2(X2,Xs)  +  ui(X2,X4).  We  created  several  1-  and  2-cluster 
models,  with  the  same  decomposition  in  different  clusters 


Figure  3:  Least-squares  projection  vs.  MAP  projection 

in  some  models  and  different  decompositions  in  other  mod¬ 
els.  The  learning  curve  tests  are  presented  in  Figure  2.  As 
the  number  of  samples  grows,  the  learned  parameters  gen¬ 
erally  seem  to  converge  to  the  generating  distribution. 

Finally,  we  tested  the  smoothing  effect  of  using  param¬ 
eter  priors  in  our  algorithm.  After  learning  the  parame¬ 
ters  of  the  model  (using  the  structure  our  data  was  gener¬ 
ated  from),  we  computed  the  values  of  the  weight  vector 
w,  using  least-squares  projection  and  MAP  projection  (as 
described  in  Section  3)  for  the  samples  in  our  test  set.  We 
compared  these  values  to  the  true  weights  w,  used  to  gen¬ 
erate  these  samples.  Figure  3  shows  the  results  on  1-  (solid 
lines)  and  2-cluster  (dotted  lines)  structured  models.  The 
upper  curve  in  both  cases  corresponds  to  the  least-squares 
projection,  the  lower  to  MAP  projection.  The  error  for 
MAP  projection  is  not  only  lower,  it  also  decreases  more 
rapidly. 

7  Conclusion  and  Extensions 

This  paper  introduces  a  new  approach  to  acquiring  and  us¬ 
ing  preference  information.  Treating  utilities  as  random 
variables  allows  us  to  deal  in  a  principled  way  with  the  un¬ 
certainty  inherent  in  utility  assessments.  It  also  helps  us 
utilize  any  prior  knowledge  we  may  have. 

We  have  presented  an  algorithm  for  learning  a  proba¬ 
bilistic  model  of  the  utility  functions  in  a  population  of 
users.  Our  approach  uses  Bayesian  learning  techniques, 
and  utilizes  some  of  the  same  principles  that  have  been  used 
successfully  in  structure  search  for  probabilistic  models. 

Our  approach  allows  us  to  discover  the  factorization 
structure  of  the  utility  functions  appropriate  for  a  given  do¬ 
main.  It  accommodates  a  wide  range  of  possible  factoriza¬ 
tions,  including  those  corresponding  to  additive,  condition¬ 
ally  additive,  and  generalized  additive  independence. 

Our  approach  is  significantly  more  expressive  than  the 
naive  linear-regression  approach  in  several  respects.  First, 
it  allows  more  general  notions  than  simple  additive  inde¬ 
pendence;  these  are  far  more  realistic  assumption  in  many 
domains.  Second,  it  explicitly  accounts  for  different  clus¬ 
ters  of  users  that  may  use  different  decompositions.  Indeed, 


our  approach  discovers  interesting  structure  in  the  prenatal 
diagnosis  domain  of  [17],  where  the  traditional  linear  re¬ 
gression  model  failed  to  do  so. 

The  statistical  learning  perspective  also  has  other  bene¬ 
fits.  By  learning  a  statistical  model  of  utilities  in  the  popu¬ 
lation,  we  are  able  to  associate  a  “confidence”  in  our  assess¬ 
ment  of  an  individual’s  utility:  if  it  is  extremely  unlikely 
given  our  model,  perhaps  fatigue  or  some  other  source  of 
noise  interfered  with  the  elicitation  process.  We  can  also 
use  the  model  to  “smooth”  our  estimates  in  a  user’s  utility 
function,  reducing  the  effects  of  noise.  Finally  and  most 
importantly,  we  can  use  this  statistical  model  to  substan¬ 
tially  ease  the  elicitation  process  (see  [4]). 

There  are  several  interesting  extensions  of  this  line  of 
work  that  we  would  like  to  pursue.  So  far,  most  work  (in¬ 
cluding  ours)  has  focused  on  notions  of  independence  at  the 
level  of  variables.  In  probabilistic  settings,  this  notion  has 
been  refined  to  that  of  context-specific  independence  [3], 
which  allows  independence  of  two  variables  X  and  Y  in 
the  context  of  a  particular  value  z  of  a  third  variable  Z,  but 
not  in  the  context  of  a  value  £  for  Z.  An  analogous  no¬ 
tion  can  also  be  defined  for  utilities.  We  hope  to  extend 
our  approach  to  handle  these  more  refined  factorizations  of 
utility  functions.  In  another  extension,  we  hope  to  capture 
relations  between  utility  variables  and  other  variables.  For 
example,  it  has  been  observed  that  people  who  have  expe¬ 
rienced  an  outcome  tend  to  assign  it  a  higher  utility  value 
than  those  for  whom  the  outcome  is  imaginary  [18].  This 
type  of  correlation  can  be  represented  very  naturally  as  a 
dependence  in  our  probabilistic  model;  we  hope  to  extend 
our  approach  to  handle  this  type  of  situation. 
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A  EM  Computation 

A  Normal-Wishart  prior  defines  a  distribution  over  the 
mean  and  covariance  matrix  of  a  Normal  distribution.  It 
is  parameterized  by:  a  precision  matrix  Rt ;  a  number 
Pz  >  /w,  —  1;  a  mean  vector  A,;  and  a  number  v,  >  0.  Es¬ 
sentially,  Rt  and  P,  define  a  Wishart  distribution  w(Rt,Pt) 
over  mt  x  m,  matrices  Qt.  The  conditional  distribution  of  /z, 
given  Qt  is  a  Gaussian  with  mean  A,  and  covariance  vtQfl. 


The  conditional  distribution  of  vectors  y  given  fa  and  Qt 
sampled  from  this  distribution  is  a  Gaussian  with  mean  fa 
and  covariance  vtQ~l . 

The  Normal- Wishart  distribution  is  conjugate  to  the 
Gaussian  distribution.  In  other  words,  if  we  have  a 
Normal- Wishart  prior  (tff,  P?, v?),  and  we  observe  vec¬ 
tors  y[l],...,y[£]  from  the  associated  Gaussian,  then  the 
posterior  distribution  over  the  parameters  is  also  Normal- 
Wishart,  with  the  following  update  rule: 
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In  our  setting,  we  assume  that  the  parameters  of 
P(W,  1 t)  are  distributed  Normal- Wishart  with  parameters 
We  also  assume  that  the  variance  of  as¬ 
sociated  with  all  of  the  variables  U0  is  distributed  one¬ 
dimensional  Wishart  with  parameters  pf,  y?  and  r^.  p,, 
y,  and  X[t  correspond  to  Rt ,  and  vt  in  the  distribution  over 

W,  and  their  update  rules  are  analogous  to  update  rules  7, 
8  and  5  respectively. 

To  do  inference  with  this  model,  we  need  to  marginalize 
out  the  parameter  prior  and  obtain  a  distribution  over  the 
domain  variables  only.  Given  a  Normal-Wishart  parameter 
distribution  (7?r,pr, A,f,vr),  the  distribution  over  W,  given  t 
is  an  n  dimensional  t  distribution,  which  can  be  approxi¬ 
mated  using  a  multivariate  Gaussian.  For  the  type-specific 
distributions,  we  get: 
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For  the  variance  of  we  set 


of  = 


The  marginalization  for  a  Dirichlet  distribution  over  the 
type,  with  hyperparameters  ar,  is  the  standard  one:  0,  = 

Or/Gf'CC,/). 

When  applying  EM  to  our  model,  the  parameters  to  be 
estimated  are  0,  and  oj  for  every  t.  The  hidden  vari¬ 
ables  are  T  and  Wr.  In  order  to  complete  the  data,  we 
must  compute  P(r[y],Wr[j]  |  u [j],params).  We  marginal¬ 
ize  the  parameter  prior,  as  we  just  described.  The  result  is 


a  Gaussian  distribution  P(WnU  1 1 ).  For  each  t ,  we  com¬ 
pute  P(W/  |  f,u[/])  and  the  marginal  P(n[j]  \  t).  We  also 
compute  the  posterior  probability  of  the  different  types  as 
P(t\u[j])~P(t)-P(u[j]\t). 

Using  these  probabilities,  we  can  easily  compute  the  (ex¬ 
pected)  sufficient  statistics  required  for  the  update  of  our 
various  parameter  priors.  For  the  Dirichlet,  we  merely  need 
the  expected  count  N(t)  =  |  u[/)).-  For  the  various 

type  specific  Gaussians,  we  must  compute  the  expected 
value  of  Xt  and  S{.  Intuitively,  we  have  to  take  the  expec¬ 
tation  over  uncountably  many  “completed”  data  cases  —  a 
continuum  of  possible  completions  for  each  j.  Fortunately, 
this  turns  out  to  be  easy.  The  key  is  that  the  posterior  dis¬ 
tribution  over  W t[j]  given  t  is  a  multivariate  Gaussian  with 
mean /(,[/]  and  covariance 2,  [/] .  Let  nt[j]  denote P(t  |  u [j]); 
intuitively  nt[j]  is  the  extent  to  which  the  jth  sample  be¬ 
longs  to  type  t,  and  therefore  the  extent  to  which  it  influ¬ 
ences  the  estimate  of  its  parameters.  It  is  straightforward  to 
verify  that 

£  = 

7=1 

y>  =  iZM/kM 

1  j= i 

st  =  (kl>1  -  JOkM 
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Finally,  we  must  compute  the  expected  empirical  vari¬ 
ance  st  needed  to  update  p,  and  in  turn  of.  Simple  linear 
algebra  shows  that,  if  W,  is  distributed  Gaussian  with  mean 
fa[j]  and  variance  £,[/],  then  U*  =  is  distributed 
Gaussian  with  mean  At  fa[j]  and  variance  Yt[j]  =At'Lt[j]Af. 
Thus,  we  get  that 

S  =  i:  7Tr[y]  -  «.)2) 
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and 


Pr  =  P?  +  5  +  ■  X  i  n,UmMo  -  Uof. 
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Essentially,  the  empirical  variance  has  components  for 
different  data  cases  j  (which  determines  P(Wt  [j]  1 1)),  and 
outcomes  o.  The  contribution  for  a  type  t  is  weighted  by  its 
probability.  For  each  j  and  o,  there  is  a  contribution  for  the 
difference  between  the  mean  of  U*  and  the  observed  utility 
for  outcome  o,  and  a  contribution  for  the  inherent  variance 
of  K. 

We  can  now  use  these  expected  sufficient  statistics  in 
place  of  the  exact  sufficient  statistics  in  Equations  (4),  (5), 
(7)  and  (8).  This  gives  us  new  estimates  of  the  posterior 
over  the  parameters  relative  to  the  completed  data.  We 
then  marginalize  the  posterior  to  induce  new  parameters, 
and  continue. 


